CUBE Cluster Introduction

A computer cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a Cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve speed and/or reliability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or reliability.

Clusters are typically configured for either high availability or high performance. High availability clusters are configured for redundancy and make use of redundant nodes to provide continuous service when primary nodes fail. This type of cluster would commonly be used for a Web server for high volume web sites where continuous access is in demand.

High-performance clusters increase performance by splitting a computational task across many different nodes in the cluster. These clusters are most common in scientific computing. High- performance clusters are optimal for workloads that require the processes on separate computer nodes to communicate actively during the computation. This includes computations where intermediate results from one node’s calculations affect future calculations on other nodes.

CUBE Cluster implemented in CUBE Voyager allows you to create your own high-performance computer cluster to distribute the workload of your CUBE Voyager model run across multiple computing nodes. Each computing node consist of a single computer processor so assembling your own computer cluster is as simple as networking several existing computers together on a local area network. Many office environments already have such local networks of computers in place which can be utilized to support CUBE Cluster. Alternatively, dedicated machines could be connected together on a local network to form an independent modeling cluster for dedicated modeling work independent of regular office computing. Still a third and potentially superior alternative would be to purchase a dedicated multiprocessor machine with each processor acting as a cluster node so that all of the computation and data sharing is contained in one local machine and is not dependent on you local network connections.

CUBE Cluster can be used to significantly reduce model run time by distributing steps in the model flow that are not dependent on one another and executing them simultaneously on distributed or parallel processing nodes. Most travel demand models will have some steps that are choke points in the model flow: additional following steps require the outputs of these steps before they can be run. Most travel demand models will also have many steps that can be run at the same Itime if independent processing nodes are available. Some reconfiguration of the model flow may be required to group those steps together that can be run simultaneously.

CUBE’s Application Manger flow chart view of your model provides a convenient tool for identifying the model steps in an application that can be distributed. Model step dependencies are easily visible from the data flows presented in the flow chart. An example is shown in the figure below. This is a fairly typical transit network development and skimming (level of service) process that might be present in many models. Transit network definitions and skim matrices are produced for two periods (peak and off peak travel) and for two separate modes of access (drive and walk only). Each of these steps is independent of each other. Assume that each of these steps run in approximately 9 to 11 minutes. This implies a total run time for this group of about 40 minutes if run sequentially as is the normal practice. If four processing nodes are available so that each step is distributed to a separate node using CUBE Cluster, then all four steps would be executed simultaneously. The result under CUBE Cluster would be that the run time for the group would now be limited to approximately the time of the longest running individual step in the group or about 11 minutes. This is a time saving of 29 minutes or a reduction in run time for the group of about 72%. If this transit group is nested in a model feedback loop which also exists in many models then the whole model process would be saving about 29 minutes for every iteration of the model feedback loop.

Typical transit skimming process: Period by access

Bentley undertook some performance testing of CUBE Cluster utilizing a recently completed client model implement in CUBE Voyager that has high model run time due to the size of the region and the complexity of the model structure. The test utilized readily available off the self computer hardware that was rented from a computer rental firm. It was felt that this level of hardware would be reasonably representative of the type of hardware that our typical clients would already have in place. Ten identical computer workstations were rented and loaded with the beta version of CUBE Voyager and CUBE Cluster. The results of the run time test are shown in the figure below for varying numbers of processing nodes. Also shown are the theoretically ideal run times if all processors could be utilized at all times during the model runs. The difference between the two curves is the result of model steps that cannot be distributed and some increases in I/O time attributed to assembly results from multiple process node back to a common working folder.

There are two forms of distributed processing available in CUBE Voyager:

Intrastep distributed processing (IDP)

This type of distributed processing works by breaking up zone based processing in a single step into zone groups that can be processed concurrently on multiple computing nodes. Currently only the Matrix and the Highway programs are available for IDP. IDP works for most type of Matrix (zone based, not RECI) and Highway processing as long as each zone can be independently processed (see Procedures that disable intrastep distributed processing for a discussion of the types of process not supported by IDP).
Multistep distributed processing (MDP)

This type of distributed processing works by breaking up blocks of one or more modeling steps and distributes them to multiple computing nodes to process. This can be used for any program in CUBE Voyager as well as user-written programs with the caveat that the distributed blocks and the mainline process must be logically independent of each other. For example, you can not run skimming procedures at the same time or before you have updated the speeds in the network you intend to skim. However, you can assign peak and off-peak period transit networks concurrently in most models. Understanding these basic relationships and dependencies in a model is very important to successfully implementing MDP.

CUBE Voyager uses a simple file based signaling method for communication between the main process and the sub-processes. Because of the zone independent requirement on IDP and the step independent requirement on MDP, it requires carefully planning and setup by the user to implement this system. The main process will check if a sub-process is available before assigning a task to it and check the sub-process run return code for errors. However, any crashes on a sub-process computer will cause the main process to wait forever and will need to be manually terminated by the user on the main as well as the sub-process computers. This should not be a problem if used with models in "production" mode that should not have any syntax or logic errors.

For distributed processing to work, the main process and all the sub-processes must have access to a shared drive on a computer network so that they can share data. The main process and all the sub-processes must map the shared drive to the same drive letter (the Subst system command can be used to map a local drive to another drive letter that matches the other processes) and they all must start with the same work directory, unless the CommPath feature is used. This is because input and script files are referenced by drive letter and directory location during a run and if they are unavailable in that location, the run will fail.

Running on a network drive could significantly slow down a run for disk intensive applications depending on the computer network’s throughput capacity so there may be little runtime benefit or take even longer when using DP on certain steps. Therefore, it is important to "tune" the DP setup to get the best performance. In general, DP works well for computation-intensive applications (for example, doing hundreds of matrix computations for each zone in a mode choice step) but will result in less time savings for disk intensive procedures (for example, combining 3 matrix files into one matrix file).